This assignment is for ETC5521 Assignment 1 by Team numbat comprising of Aarathy Babu, Lachlan Moody, Dilinie Seimon, and Jinhao Luo.

1 Introduction and motivation

2020 was a bad year for passwords. A recent audit of the ‘dark web’ reported on by Forbes unveiled that over 15 billion stolen logins were currently circulating online Winder, 2020. As stated in the article, for perspective, this represents two sets of account logins for every person on the planet.

This was the result of more than 100,000 data breaches relating to cybercrime activities, a 300% increase since 2018. So in an age where everybody is leaving an ever growing digital record of their activities from social media to banking, what can the average person do to bolster their security online?

The following analysis will explore this current issue in depth using a compilation of some of the most commonly used passwords on the web. It should be noted however that the original data was compiled in September of 2014. There is a possibility therefore that the trends and findings discussed below are not entirely applicable to the modern day. To ensure full relevancy a more up to date collection would be required. However, it is reasonable to assume the underlying foundations of password security have not changed all that much in the past few years. Additionally, the strength rating provided is calculated relative to all the other passwords in the data set. As laid out in the provided documentation, as these common passwords are mostly all ‘bad’, a high strength rating does not necessarily indicate that a password is hard to crack. However, there are additional variables that allow this to be calculated. Detailed information of the data used and the research questions formulated are provided in the following section.

2 Data description

Based upon the motivations discussed above, the following research questions were formulated. The primary subject of interest being:

What are the characteristics of the most common passwords in the interest of security?

Once this exploration area was established, three questions were composed to parameterise the proceeding analysis. They were:

  1. What are the common trends among the most commonly used passwords?
  2. How strong are the common passwords?
  3. Is high strength related to longer password crack time ?

In order to address these areas and explore the field in greater depth, data was sourced from the book Information is Beautiful (2014). This contained information on 507 passwords derived from online databases Skullsecurity and DigiNinja collected in 2014. The data was provided in a tidy format and was read into R Studio in a csv format directly from the GitHub repository provided by Tidy Tuesday (2020) using the readr (2018) package. The data contained the following variables:

  • rank: popularity of password
  • password: actual text of the password
  • category: password type category
  • value: time to crack password by online guessing
  • time_unit: unit of time for corresponding value
  • offline_crack_sec: time to crack offline in seconds
  • rank_alt: alternative value for rank (same value as rank in all cases)
  • strength: relative strength of password from 1 to 10
  • font_size: used externally to create graphic for Knowledge is Beautiful (2014)

A visualisation of the data structure can be seen below in Figure 1 using the visdat package (2017).

Initial Data Structure

Figure 2.1: Initial Data Structure

Figure 2.1 highlighted two areas that the data needed to be altered. Firstly, the variable category was recoded to a factor variable rather than a character as this was determined to be a categorical variable. Secondly, there appeared to be some missing observations in the dataset. This was examined further in Figure 2.2, produced using the naniar (2020) package, which showed that all these values were evenly distributed across the tail end of the data set.
Missing Data Values

Figure 2.2: Missing Data Values

On further investigation there appeared to be 7 blank rows at the end of the dataset. These observations were subsequently removed using dplyr (2020) as they may have negatively impacted the proceeding analysis and provided no tangible value. The final resulting data frame had 500 observations of 9 variables.

3 Analysis and findings

3.2 How strong are the common passwords?

The strength of these common passwords is an interesting feature to explore as variable strength is relative to the passwords in the dataset. Since these are commonly used passwords, their strength is expected to be less and easier to crack. The following analysis has been done to explore the dataset, to determine how strong the passwords are. Through out the analysis, the variable offline_crack_sec (the time taken to crack the password by offline guessing) is considered instead of the variable value, which depicts the time taken to crack the password by online guessing, as both of these values are proportional to each other and the results remain the same during comparisons between passwords.

Figure 3.5: 43.6 % of the passwords are relatively high in strength

In figure 3.5 above, it can be seen that about 43.6% of the commonly used passwords are passwords with relative strength between 8 and 10 on scale of 1-10 with 10 being the highest quality among these passwords. 35.4% of the passwords fall in the medium category having relative strength between 6 and 8 where as 9.2% have a weak strength of 4-6. Very Weak category passwords of strength 0-4 constitute around 8.8% of the passwords given. Around 3% of the top 500 common passwords are of strength above 10 which is an interesting outlier because it varies greatly from the strength scale limits of 1-10 set in the dataset description. Since these vary greatly from typical password strength, the passwords with strength more than 10 will not be included in the data analysis.

Another important characteristic to judge a password is to analyze the time taken to crack it. In order to analyze the time taken to crack the popular passwords, the top 10 common passwords are taken into consideration by using rank as a variable. As seen in figure 3.6, passwords like ‘1234’, ‘12345’, ‘123456’ and ‘12345678’ are so popular that it is very easily cracked taking approximately 0 seconds. Given the argument that popularity of the passwords is the reason that the passwords are predictable and therefore easily cracked, it is also interesting to see that “password” even though being the ranked one in popularity, is among passwords like ‘football’ and ‘baseball’ that take relatively more time to be cracked.

Figure 3.6: As expected, 1234 is quick to be cracked

The passwords in the dataset belong to 10 different categories like simple-alphanumeric, animal etc. To find which types of passwords are the strongest, the analysis focuses on the password strength and the time taken to crack them. In order to see the distribution of the strength of the passwords belonging to each category, a density plot is drawn below in figure 3.7 using the ggridges (2020) package. A median line is drawn so as to compare the strength across the categories. It can be seen from the plot that password types such as names, sport, cool-macho and nerdy-pop are much higher in strength than the other categories as 50 % of these passwords have strength higher than 8.

Password categories like 'simple-alphanumeric' have low strength compared to other categories

Figure 3.7: Password categories like ‘simple-alphanumeric’ have low strength compared to other categories

For further investigation of evidences to determine which type of passwords are among the strongest, time to crack the passwords are also evaluated. In order to do so the mean of the variable ‘offline_crack_sec’ is plotted against each category in the figure 3.8 below. The figure below shows an interesting pattern that shows the category ‘rebellious-rude’ passwords on an average takes the longest time to be hacked even though the median of the strength of its passwords are not as much as password types like ‘nerdy-pop’ and ‘sport’. A similar pattern is seen in ‘password-related’ type. It can also be seen that types like ‘fluffy’, even though it has high strength, the average time to hack its password is quite low.

Password categories like 'simple-alphanumeric','fluffy' and 'food' are few of the weak categories

Figure 3.8: Password categories like ‘simple-alphanumeric’,‘fluffy’ and ‘food’ are few of the weak categories

To answer the question of which password type is the strongest among these passwords, it can be said that ‘rebellious-rude’, ‘cool-macho’ type passwords are good contenders.

4 Conclusion

Through the exploratory data analysis of the dataset on Top 500 commonly used passwords, it was observed that most people tend to choose passwords that can be easily remembered, therefore a simple password that is related to a name or contains alphanumeric characters and roughly 6-7 characters long is chosen as password. On further exploration it was found that 43.6 % of the commonly used passwords are relatively high in strength and that around 3% of the passwords were of very high strength which varied greatly from typical passwords.

Furthermore ,it was observed that among the password categories, types ‘rebellious-rude’, ‘cool-macho’ are considered strong and take relatively more time to get hacked. Another striking discovery made while analyzing the data is that the hacking time and the strength of the passwords in the dataset is not under any strict relationship and that not all passwords with high strength take long to be cracked and also, not all passwords with low strength are cracked easily as there have been instances of high strength password being hacked quicker than a low strength password.

It can be concluded that most people choose common passwords that can be easily hacked and that using any of the passwords in the dataset is not recommended.

5 References

[1] Winder, D. (2020). New Dark Web Audit Reveals 15 Billion Stolen Logins From 100,000 Breaches. Retrieved 18 August 2020, from https://www.forbes.com/sites/daveywinder/2020/07/08/new-dark-web-audit-reveals-15-billion-stolen-logins-from-100000-breaches-passwords-hackers-cybercrime/#344d620180fb

[2] Mock, T. (2020). rfordatascience/tidytuesday. Retrieved 16 August 2020, from https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-01-14

[3] McCandleless, D. (2020). Knowledge is Beautiful, my new book — Information is Beautiful. Retrieved 18 August 2020, from http://www.informationisbeautiful.net/2014/knowledge-is-beautiful/

[4] Wood, R. (2020). Pipal, Password Analyser - DigiNinja. Retrieved 18 August 2020, from https://digi.ninja/projects/pipal.php

[5] Passwords - SkullSecurity. (2020). Retrieved 20 August 2020, from https://wiki.skullsecurity.org/Passwords

[6] Pie Charts. (2020). Retrieved 24 August 2020, from https://plotly.com/r/pie-charts/

[7] Elegant Visualization of Density Distribution in R Using Ridgeline - Datanovia. (2020). Retrieved 23 August 2020, from https://www.datanovia.com/en/blog/elegant-visualization-of-density-distribution-in-r-using-ridgeline/

[8] Claus O. Wilke (2020). ggridges: Ridgeline Plots in ‘ggplot2’. R package version 0.5.2. https://CRAN.R-project.org/package=ggridges

[9] Yihui Xie, Joe Cheng and Xianying Tan (2020). DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.15. https://CRAN.R-project.org/package=DT

[10] Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

[11] Tierney N (2017). “visdat: Visualising Whole Data Frames.” JOSS, 2(16), 355. doi: 10.21105/joss.00355 (URL: https://doi.org/10.21105/joss.00355), <URL: http://dx.doi.org/10.21105/joss.00355>.

[12] Nicholas Tierney, Di Cook, Miles McBain and Colin Fay (2020). naniar: Data Structures, Summaries, and Visualisations for Missing Data. R package version 0.5.2. https://CRAN.R-project.org/package=naniar

[13] Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr

[14] Hao Zhu (2019). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. R package version 1.1.0. https://CRAN.R-project.org/package=kableExtra

[15] Joe Cheng, Carson Sievert, Winston Chang, Yihui Xie and Jeff Allen (2020). htmltools: Tools for HTML. R package version 0.5.0. https://CRAN.R-project.org/package=htmltools

[16] Ian Fellows (2018). wordcloud: Word Clouds. R package version 2.6. https://CRAN.R-project.org/package=wordcloud

[17] Jeffrey B. Arnold (2019). ggthemes: Extra Themes, Scales and Geoms for ‘ggplot2’. R package version 4.2.0. https://CRAN.R-project.org/package=ggthemes

[18] C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC Florida, 2020.

[19] Joe Cheng (2020). crosstalk: Inter-Widget Interactivity for HTML Widgets. R package version 1.1.0.1. https://CRAN.R-project.org/package=crosstalk

[20] Garrick Aden-Buie (2020). ggpomological: Pomological plot themes for ggplot2. R package version 0.1.2. https://github.com/gadenbuie/ggpomological

[21] Hadley Wickham and Dana Seidel (2020). scales: Scale Functions for Visualization. R package version 1.1.1. https://CRAN.R-project.org/package=scales

[22] H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.